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Abstract 

Let a; e M be given. As we know the, amount of bits needed to binary code 
X with given accuracy (h e M) is approximately in.h{x) a; log2(max{l, |||}). 
We consider the problem where we should translate the origin a so that the mean 
amount of bits needed to code randomly chosen element from a realization of a 
random variable X is minimal. In other words, we want to find a e M such that 

R 9 a ^ E{mh{X - a)) 

attains minimum. 

We show that under reasonable assumptions, the choice of a does not depend 
on h asymptotically. Consequently, we reduce the problem to finding minimum 
of the function 

R9a^ / ln{\x - a\)f{x)dx, 
Jr 

where / is the density distribution of the random variable X. Moreover, we pro- 
vide constructive approach for determining a. 
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1. Introduction 

Data compression is usually achieved by assigning short descriptions (codes) 
to the most frequent outcomes of the data source and necessarily longer descrip- 
tions to the less frequent outcomes [HI El 13 . 

For the convenience of the reader, we shortly present theoretical background 
of this approach. Let p = (po, ■ ■ ■ ,Pn-i) be a probability distribution for a dis- 
crete random variable X. Assume that is the length of the code of Xi for 
i = 0,...,n — 1. Then the expected number of bits is given by '^^Pili- The 
set of possible codeword with uniquely decodable codes is limited by the Kraft 
inequality 2^'' < 1. It is enough to verify that lengths which minimize piU 
are given by U = logs Pi- We obtain that minimal amount of information per one 
element in lossless coding is Shannon entropy [IJ defined by 

= '^-Pi'^og2Pi. 

By this approach various types of lossless data compression were constructed. 
An optimal (shortest expected length) prefix code for a given distribution can be 
constructed by a simple algorithm discovered by Huffman [5|. 

If we want to consider continuous random variables and code with given max- 
imal error h we arrive at the notion of differential entropy [HI . Let / : M — t- M be 
a continuous density distribution of the random variable X, and let / : M — )• M be 
the function of the code length. We divide the domain of / into disjoint intervals 
of length h. Let Xh ■= \ \hx\ be the discretization of X. The Shannon entropy 
of Xh can be rewritten as follows 

i/(X,)^5^-/(a;,)/ilog2(/(a;,)/i) 

= X] -f{xi) logs {f{xi)) h - logs {h) ^ f{xi)h 

~ y"-/(a;)log2 (/(a;))dx - logs (/i) j f{x)dx '^^^ 

= J -f{xi) logs {fi^t)) dx - logs (h) ■ 

By taking the limit of H{Xh) + logs(/i) as /i — )• 0, we obtain the definition of 
the differential entrop>Q 

H{f):=- j /(x)logs(/(x))rfx. (2) 



Very often In is used instead of logj, also in this article we use this convention. 
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In this paper, we follow a different approach. Instead of looking for the best 
type of coding for a given dataset, we use standard binary coding and we search 
for the optimal center a of the coordinate system so that the mean amount of bits 
needed to code the dataset is minimal. The main advantage of this idea is that we 
do not have to fit the type of compression to a dataset. Moreover, codes are given 
in very simple way. This approach allows to immediately encrypt and decrypt 
large datasets (we use only one type of code). Clearly, classical binary code is far 
from being optimal but it is simple and commonly used in practise. 

The number of bits needed to code x G Z, by the classical binary code, is 
given by 

m(x) ~ log2(max{l, |x|}). 

Consequently, the memory needed to code x G M with given accuracy h is ap- 
proximately 




Our aim is to find the place where to put the origin a of the coordinate system so 
that the mean amount of bits needed to code randomly chosen an element from a 
sample from probability distribution of X is minimal. In other words, we want to 
find a G M such that 

E(m/i(X — a)) = / m/i(x — a)f{x)dx 
Jr 

attains minimum, where / is the density distribution of the random variable X. 

Our paper is arranged as follows. In the next section we show that under 
reasonable assumptions, the choice of a does not depend on h asymptotically. 
This reasoning is similar to the derivation of the differential entropy ([T]) . 

In the third section, we consider the typical situation when the density distri- 
bution of a random variable X is not known. We use a standard kernel method 
to estimate the density /. Working with the estimation is possible, but from the 
numerical point of view, complicated and not effective. So in the next section we 
show reasonable approximation which has better properties. 

In the fourth section, we present our main algorithm and in Appendix B we 
put full implementation. 

In the last section, we present how our method works on typical datasets. 



^In the classical binary code we use one bit for the sign and then the standard binary represen- 
tation. This code is not prefix so we have to mark ends of words. Similar coding is used in the 
decimal numeral system. 
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2. The kernel density estimation 



As it was mentioned in the previous section, our aim is to minimize, for a 
fixed h, the function a — )■ E(mh{X — a)). In this chapter, we show that the choice 
of a (asymptotically) does not depend on h. In Theorem |2.1[ we use a similar 
reasoning as in the derivation of differential entropy ([T]) and we show that it is 
enough to consider the function 



Mf{a) 



\n{\x — a\)f{x)dx. 



Theorem 2.1. We assume that the random variable X has locally bounded density 
distribution f : R ^ R. If Mf{a) < oo for a eR, then 



lim 



EimhiX - a)) - -—Mf{a) + log^ih) 



ln(2)- 

Proof. Let a G M be fixed. Then 

E(m;,(X - a)) = 



forae R. 



mh{x — a)f{x)dx 
logs 

{a—h,a+h) 



\x — a 



logs (^max { 1, 

f{x)dx 



\x — a\ 



f{x)dx 



log2 (|x — a\) f{x)dx + 



\{a—h,a+h) 

logs (k - a\) f{x)dx 
logs f{x)dx - 



logs (h) f{x)dx 

a—h,a+h) 

logs ~ '^D f{x)dx + 



(a—h,a+h) 

logs (^) fix)dx. 

' {a~h,a+h) 

Since the function / is a locally bounded density distribution so 



logs i\x - a|) f{x)dx 



' (a—h,a+h) 

Consequently 



logs (h) f{x)dx = 0. 



{a—h,a+h) 



lim 



E(m,(X - a)) 



ln(2) 



M^(a) + log2(/i) 



for a G 



□ 
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As we see, when we increase the accuracy (h — i- 0) of coding the shape of the 
function E{mh{X — a)) stabilizes (modulo subtraction of log2(/;-)). 

Example 2.1. Let f be a uniform density distribution on interval [— |, Then 

Mf{a) = / \n{\x — a\)f{x)dx = / \n{\x — a\)dx. 

Moreover, since 

/l„(x-a)dx = ln(.-a)x-Mx-a)a-x-a. 



then we have 



Mf{a) 



In (II - a\) (I - a) + In (II + a\) (| + a) - 1 for \a\ ^ \, 



-1 



for \a\ 



Function Mf(a) is presented in Fig. [7] It is easy to see that min (M/(a)) = 0. 




Figure 1: Function Mf{a) constructed for uniform density on 
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hi a typical situation, we do not know the density distribution / of random 
variable X. We only have a sample S = (xi, . . . , x^) from X. To approximate 
/ we use kernel method tSJ. For the convenience of the reader and to establish 
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the notation we shortly present this method. The kernel : M — M is simply a 
function satisfying the following condition 

K{x)dx = 1. 

Usually, i^' is a symmetric probability density function. The kernel estimator of / 
with kernel K is defined by 



1 / 



h 



where h is the window widthj^ Asymptotically optimal (for — )■ oo) choice of 
kernel K in class of symmetric and square-integrable functions is the Epanech- 
nikov kernel 

K{x) = 



1(1 -x^) 

for |a;| > 1. 



^(1 — x^) for |x| < 1, 



Asymptotically optimal choice of window width (under the assumption that den- 
sity is Gaussian) is given by 



h ^ 2.35sN 5 where s 



^ {xj - m{S)y 



Thus our aim is to minimize the function 

' Xi — h 

To compute Ms{a), we analyse the function L : M — t- M (see Fig. |2]) given by: 

3 

L: R 3 a ^ - J \n{\x - a\) (l - x"^) dx. 



^Also called "the smoothing parameter" or "bandwidth" by some authors. 
"^Often a rescaled (normalized) Epanechnikov kernel is used. 
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Figure 2: Function L. 



Lemma 2.1. We have 



1 ^ / 
Ms{a)^Hh) + -Y,L[ 



Xj — a 



Proof. By simple calculations we obtain 

Ms{a) = 



Nh ^ 



Xi—h 



ln(|x-a|)- I 1 



h 



dx — 



y 



dx 



X ^ hy + Xi 



1 



1 ^3 



1=1 



In 



y + 



i=l 



In h 



y + 



h 



{l-y')dy + ln{h) = ln{h) + -J2Li 
^ 1=1 ^ 



□ 



3. Approximation of the function L 

As it was shown in the previous section, the crucial role in our investigation is 
played by the function L and therefore, in this chapter we study its basic proper- 
ties. Let us begin with the exact formula for L. 



Proposition 3.1. The function L is given by the following formula 

r(n)-f '2 H\l -a'\) + (fa - y) In (| i±f |) + ^ - | for \a\ + 1, 
^^"^ - \ ln(2) - I for \a\ = 1. 



Moreover, L is even, L(0) = — | and lim {L{a) — ln(|a|)) = 0. 

|a|— >-oo 



Proof We consider the case when a > 1 (by similar calculation, we can get this 
result for all a e R) 



/ln(x-a)(l-.^)d. 



ln(x — a) 1 — 

1 2; - -x^ 

x—a 3 



(1 \ fx— rX''' 
X x^ J ln(|a; — a|) — / — dx 
3 y J x-a 



— \n(x — a) ( — -x^ + X — a + -a^] — x + a + -x^ + -x^a + -xa? — —i 
3 o I 9 6 3 18 



Consequently 



L{a) 



\ ln(|l - a^l) + (fa - \a^) In (| |) + - | for |a| ^ 1, 
ln(2) - I for \a\ = 1. 



As a simple corollary, we obtain that L is even and L(0) = — |. To show the last 
property, we use the equaUty 



J ln{\x - a\) {1 - x^) dx = J ln{\x + a\) {l - x"^) dx . 



Then for lal > 1, we obtain 



J \n{\x - a\) {1 - x"^) dx = ^ J {ln{\x - a\) + \n{\x + a\)) {l - x^) dx 
= I £ In (a^) (1 - x^) dx + ^ In (l - (l - dx = 
= i.n(a^) + 5£.n(l-g(l-.V- 
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Since 



-1 



In 1 - 



(l — x^) dx E 
1 



mm 



ce[-i,i] v2 



In 1 - 



X 



max 



xel-1,1] V 2 



In 1 - 



In 1 - 



we get 



> lim (L(a) - ln(a)) > lim - In 1 



0. 



□ 



From the numerical point of view, the use of the function L (Fig. [2]) is compli- 
cated and not effective. The main problem is connected with a possible numerical 
instability for a close to 1. Moreover, in our algorithm we use the first deriva- 
tive (more information in the next chapter and Appendix A) by considering the 
function 



In 



a + 



Va+l 



+ 2y/a for a 7^ 1, 
for a = 1. 



In this case, we have numerical instability for a close to 0, 1. Thus, instead of L, 
we use Cauchy M-estimator [[TOl L : M — M which is given by 



L(a) := -ln(e"5 +a^). 

The errors caused by the approximation are reasonably small in relation to those 
connected with kernell estimatior(3 

Observation 3.1. The function L is analytic, even, 1^(0) = L{0) and lim (L(a) — 

|a|— >oo 

L{a)) = 0. 

Consequently, the problem of finding the optimal (respectively to the needed 
memory) center of the coordinate system can be well approximated by searching 
for the global minimum of the function 



Msia) 



1 ^ 



i=l 



h 



for a G 



^As it was said in this method one assume that dataset is realization of Gaussian random vari- 
able while usually, in practise, does not have to. 



9 



Figure 3: Comparison of functions L and L. 



4. Search for the minimum 

In this section, we present a metliod of finding tlie minimum of the function 
Ms{a). We use the robust technique which is called M-estimator 

Remark 4.1. For the convince of the reader we shortly present the standard use 
of the M-estimator's method. Let {xi, . . . , x„} be a given dataset. We are looking 
for the best representation of the points 



Normally we choose barycentre of the data but elements which are fare from the 
center usually courses undesirable effect. The M-estimators try to reduce the 
effect of outliers by replacing the squares by another function ( in our case Cauchy 
M-estimator I J()] j 



where L is a symmetric, positive-definite function with a unique minimum at zero, 
and is chosen to be less increasing than square. Instead of solving directly this 
problem, one usual implements an Iterated Reweighted Least Squares method 
(IRLS) [see Appendix A]. In our case we are looking for 



mm 



a 




2 




a 




a 
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where L is interpreted as a function which describes the memory needed to code 
Xi with respect to a. 

Our approach based on Iterated Reweighted Least Squares method (IRLS), is 
similar to that presented in [jH and [O. For convenience of the reader, we include 
the basic theory connected with this method in Appendix A. 

In our investigations the crucial role is played by the following proposition. 

Corollary IRLS (see Appendix A). Let / : IR+ -> IR+, /(O) = Q be a given 
concave and differentiable function and let S = {xi, . . . ,xn} be a given data-set. 
We consider the function 

F(a) := J^/da-x^n foraeR. 

i 

Let a GMbe arbitrarily fixed and let 

Wi = f {\xi — a\'^) for i = 1 . . . N and a. 

Then 

F(a^) <F{a). 

Making use of the above corollary, by substitution a — > a^; in each step we 
come closer to a local minimum of the function F. It is easy to see, that L defined 
in the previous section, satisfies the assumptions of Corollary IRLS. Let S = 
(xi, . . . , xat) be a given realization of the random variable X and let 



h = 2.35N-r, 

The algorithm (based on IRLS) can be described as follows. 

Algorithm. 



N 



N 

y^^wjXj. 



E{xi - m{S)Y 
N -I 



11 



initialization 

stop condition 

£ > 
initial condition 

j = 

Gj = m{S) 
repeat 

calculate 




100 200 300 4O0 



Figure 4: Estimation of the density distribution for the Forex data. 

The first initial point can be chosen in many different ways. We usually start 
from the barycentre of the dataset because, for equal weights, the barycentre is the 
best approximation of the sample (for full code written in R Project, see Appendix 
B). 

Now we show how our method works in practise. 

Example 4.1. Let S be the sample of the index of USD/EUR from Forex stoke 
l^. The density obtain by the kernel method is presented at Fig. ^ As a result 
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of the algorithm we obtained alg_centre(S') = 238.4174. We compare our result 
with the global minimum argmin(M5) = 239.509 and the barycentre of data 
m{S) = 212.8004.- 

ininM^ = 4.0477, 

M5(alg_centre(5)) = 4.0478, 

Ms{m{S)) = 4.0768. 

As we see the difference between min M5 and M5(alg_centre(S')) is small. 
Moreover, the barycentre gives a good approximation of the memory centre but, 
as we see in next examples, the difference can be large for not uni-modal densities. 

In the next step we consider a random variables of the form 

X ■.= pi-Xi+p2-X2 

where Xi,X2 are two normal random variable^ or two uniform random vari- 
ablefl 

In Table [T] we present comparison of the result of our algorithm and global 
minimum of the function Ms where S is the realization of random variable X 
of size 500 with different parameters. As we see, in the second and the third 
columns the algorithm which uses the function L, gives a good approximation of 
the minimum for the function L. It means that the use of L from the third section 
is reasonable and causes minimal errors. 

In our cases, we obtained a good approximation of the global minimum. More- 
over, the difference between the minimum of the original function and the result of 
our algorithm is small (see fifth column). Consequently, we see that the barycentre 
of data sets is a good candidate for initial point. 

Clearly (see sixth column), we see that the barycentre of a data is not a good 
approximation of the memory center, especially in the situation of not uni-modal 
densities. 

5. Appendix A 

In the fourth section, we presented a method for finding the minimum of the 
function Ms (a). Our approach based on the Iterated Reweighted Least Squares 



*The normal random variables with means m and the standard deviation s we denote by N(^m,s) ■ 
^The uniform random variable on the interval [a, b] we denote by U[a,b]- 
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Ti-i • A 1 -1- Tin • An 




n 






' (— 1,1) ' "•^-''(1,1) 


-0 347 


-0 379 


00017 


00658 


"J-^-' ' (— 6,1) ' "•^-''(6,1) 


5 803 


5 676 


00079 


55374 




0.483 


0.416 


0.00093 


0.01353 


0.3iV(_6,o.5) + 0.7iV(6,i) 


5.979 


5.863 


0.00089 


0.57296 


0.2iV(_2,o.5) + 0.8iV(3,2) 


2.721 


2.770 


0.00021 


0.05129 


0.6f/[_3,-i] + 0.4t/[o,i] 


-1.805 


-1.824 


0.00014 


0.19388 


0.4f/[_3,-i] + 0.6t/[o,i] 


0.364 


0.403 


0.00144 


0.41139 


0.3f/[_3,_i] + 0.7%!] 


0.433 


0.447 


0.00028 


0.44633 


0.2f/[_2,-i] + 0.8f/[i,2] 


1.403 


1.445 


0.00265 


0.38220 


0.2f/[_5,-2] + 0.8f/[3,4] 


3.464 


3.439 


0.00019 


0.50817 



Table 1: In this table we have following the notation: ~ alg_centre(S') and a. 
argmin(Ms). 



algorithm (IRLS). The method can be applied in statistic and computer since. We 
have used them to minimize the function Ms{a). Similar approach is presented in 
191 and [6|. For convenience of the reader, we show the basic theoretical informa- 
tion about IRLS algorithm. The main theorem, related with this method, can be 
formulated as follows: 

Theorem IRLS. Let : IR+ — IR+, /(O) = be a set of concave and differ- 
entiable function and let X = (xi) be a given data-set. Let a G M be fixed. We 
consider functions 

F{a)■.= Y,fi{\<^-^^?) foraeR 

i 

and 

H{a) := ^[fi{\a-Xi\'^)-f-{\a-Xi\'^)\a-Xi\'^]+f-{\a-Xi\'^)\a-Xi\'^ for a e M. 

i 

Then H{a) = F{d) and 

H>F. 

Proof. Let i be fixed. For simplicity, we denote / = /». 

Let us first observe that, without loss of generality, we may assume that x = 
(we can make an obvious substitution a — )■ a + x). 
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Then since all the considered functions are radial, it is sufficient to consider 

the one dimensional case when = 1. Thus, from now on we assume that = 1 
and X = 0. Since the functions are even, it is sufficient to consider the situation 
on M+. 

Concluding: we are given a e R+ and consider functions 

and 

h:R+3a^ [/(a^) - f'{a^)a^] + f{a^)a\ 

Clearly, h{a) = g{a). We have to show that h > g. We consider two cases. 

Let us first discuss the situation on the interval [0, a]. We show that g — his 
increasing on this interval, since coincide at a this makes the proof completed. 
Clearly g and h are absolutely continuous functions (since / is concave). Thus, to 
prove that g — his increasing on [0, a], it is sufficient to show that g' > h' a.e. on 
[0, a]. But / is concave, and therefore /' is decreasing, which implies that 

g'{a) = /'(a2)2a > f'{a'')2a = h'{a). 

So let us consider the situation on the interval [a, oo). We will show that g — h 
is decreasing on [a, oo). Since {g — h){a) — 0, this is enough. Note that 

{g - h)\a) = f'{^)2a - f\a')2a = 2a(/'(a2) - f'{a'j) < 0. 

Now, the assertion of the theorem is a simple consequence of the previous 
property. □ 

Now we can form the most important results: 

Corollary IRLS. Let f : IR+ IR+, /(O) = Q be a given concave and differen- 
tiable function and let S — {xi}^]^ be a given data-set. We consider the function 

F{a) ^^f{\a-Xi\^) /oraeR. 

i 

Let a EWbe arbitrarily fixed and let 

Wi — f'{\a — Xif) fori — l...N and a. 

Then 

F{a^)<F{a). 



N 



N 

E 

i=Q 



WiXi. 
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Proof. Let H be defined like in Theorem IRLS 



^(a) = 5^[/(|a-^i|')-/'(|a-^ir)l«-^ir]+/'(l«-^in|a-^«l'foi-aeM. 

i 

Moreover by Theorem IRLS we have 

F{a) = H{a). 
Function H is quadratic so the minimum is 



1 ^ 



and consequently 



F(a) = H{a) > H{a^). 



□ 



Making use of the above theorem, we obtain a simple method of finding a 
better approximation of the minimum. For given a G M, by taking weighted 
average (see Corollary IRLS) we find the point which reduces the value of the 
function. So, to find minimum, we iteratively calculate weighted barycentre of 
data set. 

6. Appendix B 

In this section we present source code of our algorithm written in R Project. 



#The derivative of the function bar L 
d_bar_L = function ( s ) {(exp(-8/ 3) + s )"(- 1) } 

#The function which describe a single iteration 
#S — data p — starting point 
iteration = f unction (p , S ) { 

suma=0 

weight=0 

v=sd(S) #standard deviation 
h = 2.35*v*(lengtli(S))"(-l/5) 
for(s in S){ 

weight _s=d_ bar _L(( abs(s-p) "2) / (h"2) ) 

suma=suma+ weight _ s * s 

weight=weight + weight _s 
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} 

suma/ weight 

} 

#The function which determine a local minimum 
#S — data 
#e — epsilon 

#N ~ maximal number of iterations 
look _minimum= function ( S , e ,N) { 

ans=mean(S) 

dist=l 

while (dist>e & N>0){ 

ans _n= iteration (ans ,S) 
N=N-1 

dist=abs(ans _n— ans ) 

ans = ans _n 

} 

ans 

} 



In our simulation we use e = 0.001 and = 50. 
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